Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Absent alert duration #1348

Merged
merged 1 commit into from
Dec 4, 2024
Merged

Improve Absent alert duration #1348

merged 1 commit into from
Dec 4, 2024

Conversation

metalmatze
Copy link
Member

@metalmatze metalmatze commented Dec 4, 2024

This PR introduces the AbsentDuration method.
AbsentDuration calculates the duration when absent alerts should fire. The idea is as follows:
Use the most critical of the multi burn rate alerts. For that alert to fire, both the short AND long windows have to be above the threshold. The long window takes the - longest - to fire. Assuming absence of the metric means 100% error rate, the time it takes to fire is the duration for the long window to go above the threshold (factor * objective). Finally, we add the "for" duration we add to the multi burn rate alerts.

Taking the windows and the objective into account the absent alert depends on your SLO's objective and window.

Looking at the tests the duration significantly increased for most absent alerts.

This PR introduces the AbsentDuration method.
AbsentDuration calculates the duration when absent alerts should fire. The idea is as follows:
Use the most critical of the multi burn rate alerts. For that alert to fire, both the short AND long windows have to be above the threshold. The long window takes the - longest - to fire. Assuming absence of the metric means 100% error rate, the time it takes to fire is the duration for the long window to go above the threshold (factor * objective). Finally, we add the "for" duration we add to the multi burn rate alerts.

Taking the windows and the objective into account the absent alert depends on your SLO's objective and window.

Looking at the tests the duration significantly increased for most absent alerts.
@metalmatze
Copy link
Member Author

This should make absent alerts a lot less flakey.

Please take a look, @brancz.
FYI, @kakkoyun, @lilic, @jzelinskie.

I'll release this as v0.8.1 once merged.

Copy link

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! lgtm

@metalmatze metalmatze merged commit 10766e0 into release-0.8 Dec 4, 2024
10 checks passed
@metalmatze metalmatze deleted the absent-duration branch December 4, 2024 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants